The palmerpenguins data contains size measurements for
three penguin species observed on three islands in the Palmer
Archipelago, Antarctica.
These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal, and are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station Data Policy.
Refresher… how do we install and load a package?
install.packages("palmerpenguins")
library(palmerpenguins)
This package contains two datasets:
Here, we’ll focus on a curated subset of the raw data in the
package named penguins.
The raw data, accessed from the Environmental Data Initiative (see
full data citations below), is also available as
palmerpenguins::penguins_raw.
The curated palmerpenguins::penguins dataset contains 8
variables (n = 344 penguins). You can read more about the variables by
typing ?penguins.
glimpse(penguins)
#> Rows: 344
#> Columns: 8
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex <fct> male, female, female, NA, female, male, female, male…
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
The palmerpenguins::penguins data contains 333 complete
cases, with 19 missing values.
Let’s find the smallest penguin observed in each species (hint: you
can use the min() function)
penguins %>%
group_by(species) %>%
filter(body_mass_g == min(body_mass_g, na.rm = TRUE))
#> # A tibble: 4 × 8
#> # Groups: species [3]
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Biscoe 36.5 16.6 181 2850
#> 2 Adelie Biscoe 36.4 17.1 184 2850
#> 3 Gentoo Biscoe 42.7 13.7 208 3950
#> 4 Chinstrap Dream 46.9 16.6 192 2700
#> # ℹ 2 more variables: sex <fct>, year <int>
Practice mutating! Let’s create a new column that has bill size (area, in square milimeters).
The culmen is the upper ridge of a bird’s bill. In the simplified
penguins data, culmen length and depth are renamed as
variables bill_length_mm and bill_depth_mm to
be more intuitive. For this penguin data, the culmen (bill) length and
depth are measured as shown below (thanks Kristen Gorman for
clarifying!):
penguins %>%
mutate(bill_size_mm2 = bill_depth_mm * bill_length_mm) %>%
head()
#> # A tibble: 6 × 9
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750
#> 2 Adelie Torgersen 39.5 17.4 186 3800
#> 3 Adelie Torgersen 40.3 18 195 3250
#> 4 Adelie Torgersen NA NA NA NA
#> 5 Adelie Torgersen 36.7 19.3 193 3450
#> 6 Adelie Torgersen 39.3 20.6 190 3650
#> # ℹ 3 more variables: sex <fct>, year <int>, bill_size_mm2 <dbl>
Let’s select all columns that contain measurements in mm.
One possible solution is….
penguins %>%
select(ends_with("mm"))
#> # A tibble: 344 × 3
#> bill_length_mm bill_depth_mm flipper_length_mm
#> <dbl> <dbl> <int>
#> 1 39.1 18.7 181
#> 2 39.5 17.4 186
#> 3 40.3 18 195
#> 4 NA NA NA
#> 5 36.7 19.3 193
#> 6 39.3 20.6 190
#> 7 38.9 17.8 181
#> 8 39.2 19.6 195
#> 9 34.1 18.1 193
#> 10 42 20.2 190
#> # ℹ 334 more rows
Another possible solution is…
penguins %>%
select(contains("mm"))
#> # A tibble: 344 × 3
#> bill_length_mm bill_depth_mm flipper_length_mm
#> <dbl> <dbl> <int>
#> 1 39.1 18.7 181
#> 2 39.5 17.4 186
#> 3 40.3 18 195
#> 4 NA NA NA
#> 5 36.7 19.3 193
#> 6 39.3 20.6 190
#> 7 38.9 17.8 181
#> 8 39.2 19.6 195
#> 9 34.1 18.1 193
#> 10 42 20.2 190
#> # ℹ 334 more rows
Let’s find the median body mass for each species.
One possible solution… (using mutate()).
penguins %>%
remove_missing() %>%
group_by(species) %>%
mutate(body_mass_median = median(body_mass_g))
#> # A tibble: 333 × 9
#> # Groups: species [3]
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750
#> 2 Adelie Torgersen 39.5 17.4 186 3800
#> 3 Adelie Torgersen 40.3 18 195 3250
#> 4 Adelie Torgersen 36.7 19.3 193 3450
#> 5 Adelie Torgersen 39.3 20.6 190 3650
#> 6 Adelie Torgersen 38.9 17.8 181 3625
#> 7 Adelie Torgersen 39.2 19.6 195 4675
#> 8 Adelie Torgersen 41.1 17.6 182 3200
#> 9 Adelie Torgersen 38.6 21.2 191 3800
#> 10 Adelie Torgersen 34.6 21.1 198 4400
#> # ℹ 323 more rows
#> # ℹ 3 more variables: sex <fct>, year <int>, body_mass_median <dbl>
Another possible solution…(using summarize()).
penguins %>%
remove_missing() %>%
group_by(species) %>%
summarize(body_mass_median = median(body_mass_g))
#> # A tibble: 3 × 2
#> species body_mass_median
#> <fct> <dbl>
#> 1 Adelie 3700
#> 2 Chinstrap 3700
#> 3 Gentoo 5050
Let’s find the median of everything! This time also grouping by year.
penguins %>%
remove_missing() %>%
group_by(species, year) %>%
summarize(across(where(is.numeric), median))
#> # A tibble: 9 × 6
#> # Groups: species [3]
#> species year bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 Adelie 2007 39 18.6 186 3675
#> 2 Adelie 2008 38.6 18.3 190 3700
#> 3 Adelie 2009 38.7 18.0 191 3600
#> 4 Chinstrap 2007 48.8 18.2 194. 3700
#> 5 Chinstrap 2008 49.2 18.5 198. 3750
#> 6 Chinstrap 2009 50.0 18.6 198 3675
#> 7 Gentoo 2007 46.7 14.6 215 5050
#> 8 Gentoo 2008 46.4 15 219 5000
#> 9 Gentoo 2009 48.8 15.2 218 5200
Let’s create a new column that classifies bill size into two categories – big or small (given a threshold of 800 square millimeters).
threshold <- 800 ### first define a threshold to distinguish big from small
penguins %>%
mutate(bill_size_mm2 = bill_depth_mm * bill_length_mm,
bill_size_binary = ifelse(bill_size_mm2 > threshold, "big", "small")) %>%
select(bill_size_binary, bill_size_mm2, everything()) %>%
head()
#> # A tibble: 6 × 10
#> bill_size_binary bill_size_mm2 species island bill_length_mm bill_depth_mm
#> <chr> <dbl> <fct> <fct> <dbl> <dbl>
#> 1 small 731. Adelie Torgersen 39.1 18.7
#> 2 small 687. Adelie Torgersen 39.5 17.4
#> 3 small 725. Adelie Torgersen 40.3 18
#> 4 <NA> NA Adelie Torgersen NA NA
#> 5 small 708. Adelie Torgersen 36.7 19.3
#> 6 big 810. Adelie Torgersen 39.3 20.6
#> # ℹ 4 more variables: flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
#> # year <int>
Let’s use the tidyselect function where()
to select all the columns that are factor.
penguins %>%
dplyr::select(where(is.factor)) %>% ## note that the where() function is a tidyselect function
glimpse()
#> Rows: 344
#> Columns: 3
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie…
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
#> $ sex <fct> male, female, female, NA, female, male, female, male, NA, NA, …
Use the count() function to count the number of penguins
of each species on each island.
# Count penguins for each species / island
penguins %>%
count(species, island, .drop = FALSE)
#> # A tibble: 9 × 3
#> species island n
#> <fct> <fct> <int>
#> 1 Adelie Biscoe 44
#> 2 Adelie Dream 56
#> 3 Adelie Torgersen 52
#> 4 Chinstrap Biscoe 0
#> 5 Chinstrap Dream 68
#> 6 Chinstrap Torgersen 0
#> 7 Gentoo Biscoe 124
#> 8 Gentoo Dream 0
#> 9 Gentoo Torgersen 0
Use ggplot() to make a graph that shows the number of
each species of penguin on each island (hint: use
geom_bar(), and fill = species).
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(alpha = 0.8, stat = "count") +
scale_fill_manual(values = c("darkorange","purple","cyan4"),
guide = FALSE) +
theme_minimal() +
facet_wrap(~species, ncol = 1) +
coord_flip()
Use the count() function to count the number of penguins
of each sex and species.
# Count penguins for each species / sex
penguins %>%
count(species, sex, .drop = FALSE)
#> # A tibble: 8 × 3
#> species sex n
#> <fct> <fct> <int>
#> 1 Adelie female 73
#> 2 Adelie male 73
#> 3 Adelie <NA> 6
#> 4 Chinstrap female 34
#> 5 Chinstrap male 34
#> 6 Gentoo female 58
#> 7 Gentoo male 61
#> 8 Gentoo <NA> 5
Use ggplot() to make a graph that shows the number of
penguins of each sex for each species.
ggplot(penguins, aes(x = sex, fill = species)) +
geom_bar(alpha = 0.8) +
scale_fill_manual(values = c("darkorange","purple","cyan4"),
guide = FALSE) +
theme_minimal() +
facet_wrap(~species, ncol = 1) +
coord_flip()
Use across() and where() to calculate the
mean of all the variables that are numeric.
# Penguins are fun to summarize!
penguins %>%
group_by(species, year) %>%
summarize(across(where(is.numeric), mean, na.rm = TRUE))
#> # A tibble: 9 × 6
#> # Groups: species [3]
#> species year bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 Adelie 2007 38.8 18.8 187. 3696.
#> 2 Adelie 2008 38.6 18.2 191. 3742
#> 3 Adelie 2009 39.0 18.1 192. 3665.
#> 4 Chinstrap 2007 48.7 18.5 192. 3694.
#> 5 Chinstrap 2008 48.7 18.4 198. 3800
#> 6 Chinstrap 2009 49.1 18.3 198. 3725
#> 7 Gentoo 2007 47.0 14.7 215. 5071.
#> 8 Gentoo 2008 46.9 14.9 218. 5020.
#> 9 Gentoo 2009 48.5 15.3 218. 5141.
Use a tidyselect() helper function to
select the mass and length columns.
penguins %>%
dplyr::select(body_mass_g, ends_with("_mm")) %>%
glimpse()
#> Rows: 344
#> Columns: 4
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
Use ggplot() to make a scatterplot of body mass as a
function of flipper length. Display the species information using the
colour and shape of the points.
# Scatterplot example 1: penguin flipper length versus body mass
ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(colour = species,
shape = species),
size = 2) +
scale_color_manual(values = c("darkorange","darkorchid","cyan4"))
Use ggplot() to plot the relationship between body mass
and flipper length, coloured by sex, faceted by species.
ggplot(penguins, aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point(aes(color = sex)) +
scale_color_manual(values = c("darkorange","cyan4"),
na.translate = FALSE) +
facet_wrap(~species)
Use ggplot() to explore the distribution of bill length
by species (hint you can use geom_jitter() or
geom_histogram().
# Jitter plot example: bill length by species
ggplot(data = penguins, aes(x = species, y = bill_length_mm)) +
geom_jitter(aes(color = species),
width = 0.1,
alpha = 0.7,
show.legend = FALSE) +
scale_color_manual(values = c("darkorange","darkorchid","cyan4"))
# Histogram example: flipper length by species
ggplot(data = penguins, aes(x = flipper_length_mm)) +
geom_histogram(aes(fill = species), alpha = 0.5, position = "identity") +
scale_fill_manual(values = c("darkorange","darkorchid","cyan4"))
Data originally published in:
Individual datasets:
Individual data can be accessed directly via the Environmental Data Initiative:
Palmer Station Antarctica LTER and K. Gorman, 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Adélie penguins (Pygoscelis adeliae) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5. Environmental Data Initiative. https://doi.org/10.6073/pasta/98b16d7d563f265cb52372c8ca99e60f (Accessed 2020-06-08).
Palmer Station Antarctica LTER and K. Gorman, 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Gentoo penguin (Pygoscelis papua) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5. Environmental Data Initiative. https://doi.org/10.6073/pasta/7fca67fb28d56ee2ffa3d9370ebda689 (Accessed 2020-06-08).
Palmer Station Antarctica LTER and K. Gorman, 2020. Structural size measurements and isotopic signatures of foraging among adult male and female Chinstrap penguin (Pygoscelis antarcticus) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 6. Environmental Data Initiative. https://doi.org/10.6073/pasta/c14dfcfada8ea13a17536e73eb6fbe9e (Accessed 2020-06-08).